On-The-Fly  Detection  Of  Access  Anomalies 

by 

Edith  Schonberg 

Ultracomputer  Note  #149 
October,  1988 


Ultracomputer  Research  Laboratory 


New  York  University 

Courant  Institute  of  Mathematical  Sciences 

Division  of  Computer  Science 

251  Mercer  Street,  New  York,  NY  10012 


On-The-Fly  Detection  Of  Access  Anomalies 

by 
Edith  Schonberg 

Ultracomputer  Note  #149 
October,  1988 


This  work  was  supported  under  the  Applied  Mathematical  Sciences  Program  of  the  U.S.  Department  of  Ener- 
gy under  contract  DE-FG02-88ER25052. 


ABSTRACT 

Access  anomalies  are  a  common  class  of  bugs  in  shared-memory  parallel  pro- 
grams. An  access  anomaly  occurs  when  two  concurrent  execution  threads 
both  write  (or  one  thread  reads  and  the  other  writes)  the  same  shared  memory 
location.  Approaches  to  the  detection  of  access  anomalies  include  static 
analysis,  post-mortem  trace  analysis,  and  on-the-fly  monitoring. 

A  general  on-the-fly  algorithm  for  access  anomaly  detection  is  presented, 
which  can  be  applied  to  programs  with  both  nested  fork-join  and  synchroniza- 
tion operations.  The  advantage  of  on-the-fly  detection  over  post-mortem 
analysis  is  that  the  amount  of  storage  used  can  be  greatly  reduced  by  data 
compression  techniques  and  by  discarding  information  as  soon  as  it  becomes 
obsolete.  In  the  algorithm  presented,  the  amount  of  storage  required  at  any 
time  depends  only  on  the  number  V  of  shared  variables  being  monitored  and 
the  number  N  of  threads,  not  on  the  number  of  synchronizations.  Data 
compression  is  achieved  by  the  use  of  two  techniques  called  merging  and  sub- 
traction. Upper  bounds  on  storage  are  shown  to  be  V  x  N2  for  merging  and 
V  x  N  for  subtraction. 


1.    Introduction 

In  shared-memory  parallel  programs,  a  common  and  vexing  class  of  parallel  bugs  arise 
from  access  anomalies.  By  shared-memory  parallel  programs,  we  mean  programs  in  which 
concurrent  execution  streams  directly  read  and  write  shared-memory.  An  access  anomaly 
occurs  when  two  concurrent  execution  streams  both  write  (or  one  reads  and  the  other  writes) 
the  same  memory  location.  Access  anomalies  are  almost  always  bugs.1  They  result  either  from 
a  violation  of  data  dependences  or  from  miscellaneous  program  logic  errors,  such  as  in  sub- 
scripting or  addressing.  In  Ada  access  anomalies  render  a  program  erroneous,  that  is,  they 
have  undefined  semantics. 

To  illustrate  an  access  anomaly,  consider  the  small  code  sequence  below,  written  in  For- 
tran extended  by  a  parallel  doall: 


'An  exception  to  this  is  chaotic  relaxation  (Bau]. 


doalli  -  1,2 

A[i]  =  B[i]  +  1 

doaUj  =  1,2 

Qi,j]  =  A[i+1]  +  J3[j] 
endall 

endall 

The  dynamic  parallel  flow  of  control  for  this  program  is  shown  in  Figure  1,  where  each 
separate  execution  stream  is  labeled  by  an  identifier  Tj.  Since  location  A[2]  is  written  by  exe- 
cution stream  T2  and  read  by  execution  streams  T3  and  T4,  there  is  an  access  anomaly.  (In 
this  case,  the  doall  does  not  respect  some  data  dependence  in  the  original  sequential  program.) 

Access  anomalies  are  often  difficult  to  locate  using  conventional  debugging  techniques 
because  they  are  sources  of  non-determinism.  Different  executions  of  the  same  program  may 
lead  to  different  orderings  in  the  code  sections  containing  the  anomaly.  For  example,  in  Fig- 
ure 1,  location  A[2j  in  T3  may  or  may  not  be  accessed  before  the  assignment  in  T2,  depending 
on  relative  execution  speeds;  thus  the  value  of  C[l,l]  may  vary  from  run  to  run. 

In  this  paper  we  present  an  algorithm  for  detecting  access  anomalies  for  a  general  model 
of  parallel  programs.  Detection  is  performed  on-the-fly,  and  the  algorithm  has  the  property 
that  the  space  needed  to  perform  the  analysis  depends  only  on  the  number  of  shared  variables 
and  the  number  of  parallel  execution  threads,  not  on  the  number  of  synchronizations  or  com- 
munications among  the  parallel  threads.  The  algorithm  therefore  offers  a  practical  solution  to 
the  anomaly  detection  problem  for  large  and  long-running  programs. 

In  the  rest  of  this  section,  we  discuss  other  work  that  has  been  done  in  this  area.  Section 
2  presents  a  formal  framework  for  our  algorithm.  A  general  high-level  algorithm  is  presented 
in  Section  3  for  programs  with  pairwise  synchronization.  The  storage  properties  of  the  algo- 
rithm are  derived  from  two  data  compression  techniques-  the  first,  called  merging,  is 
described  in  Section  4,  and  the  second,  called  subtraction,  is  described  in  Section  5.  We  prove 
that  the  amount  of  storage  used  is  bounded  by  V  x  N2/2  for  merging  and  V  x  N  for  subtrac- 
tion, where  V  is  the  number  of  shared  variables  monitored  and  N  is  the  number  of  execution 
threads.  Section  6  extends  these  results  to  programs  with  nested  fork-join  operations;  and  Sec- 
tion 7  shows  how  to  apply  the  algorithm  to  a  variety  of  programming  language  constructs, 
including  the  Ada  rendezvous,  barrier  synchronization,  lock/unlock  operations,  and  message 
passing  primitives.  Conclusions  are  drawn  in  Section  8. 
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1.1.    Related  Work 

A  variety  of  approaches  have  been  explored  for  detecting  access  anomalies.  One 
approach  is  static  analysis  [App,  Car,  Tay],  the  goal  of  which  is  to  detect  all  potential  access 
anomalies  prior  to  execution.  In  [App]  and  [Tay],  sections  of  code  which  are  potentially  con- 
current are  identified;  shared  variables  read  and  written  in  these  sections  are  potential 
anomalies.  To  reduce  the  set  of  potential  anomalies  found,  further  static  analysis  is  required, 
such  as  subscript  range  and  alias  analysis,  as  well  as  data  dependence  analysis  [Bur,  Car].  At 
best,  static  analysis  can  only  indicate  a  conservative  superset  of  potential  anomalies.  Faced 
with  too  many  false  reports,  it  is  easy  for  a  programmer  to  disregard  the  advice  of  an  overly 
conservative  static  tool. 

Dynamic  access  anomaly  detection,  on  the  other  hand,  is  a  complementary  approach  that 
can  guarantee  that  there  are  no  anomalies  in  any  particular  execution  instance  of  a  program. 
Dynamic  detection  tests  whether  a  potential  anomaly  revealed  by  static  methods  is  a  real  ano- 
maly, while  static  analysis  is  useful  to  reduce  the  number  of  variables  that  have  to  be  moni- 
tored at  run-time.  In  [Mil],  the  dynamic  access  anomaly  detection  problem  is  defined  for  a 
variety  of  parallel  programming  constructs,  but  no  efficient  algorithm  is  presented. 

The  approach  of  [All]  combines  static  analysis  with  dynamic  detection.  A  history  trace  is 
generated  during  program  execution,  and  the  trace  is  analysed  in  a  port-mortem  phase  to 
detect  anomalies.  Static  analysis  is  used  to  reduce  the  number  of  variables  that  have  to  be 
checked  at  run-time,  in  order  to  compress  the  dynamic  trace,  and  to  locate  potential  access 
anomalies  that  may  be  hidden  by  earlier  access  anomalies.  The  drawback  of  methods  based  on 
history  traces  and  post-mortem  analysis  is  that  traces  readily  grow  too  large,  even  when 
compression  techniques  are  used.  In  [All],  a  record  must  be  stored  for  each  basic  block  exe- 
cuted and  additionally  for  each  array  operation. 

On-the-fly  detection  is  a  more  promising  dynamic  approach,  in  which  anomalies  are 
found  while  the  program  is  executing,  rather  than  in  a  post-mortem  phase.  Additional  storage 
is  used  to  save  shared  variable  access  information  necessary  to  perform  the  analysis,  but  the 
size  of  this  storage  is  much  less  than  is  required  by  trace-based  methods. 

Variants  of  on-the-fly  algorithms  are  given  in  [Nud,  Sni].  In  both  of  these  algorithms,  a 
solution  is  presented  only  for  a  limited  shared-memory  programming  model  in  which  parallel- 
ism is  expressed  using  a  nestable  doall-endall  construct,  but  the  model  does  not  include  primi- 
tives for  communication  and  synchronization  among  parallel  threads. 

The  on-the-fly  anomaly  detection  algorithm  presented  here  is  more  general  than  both 
those  of  [Nud]  and  [Sni].  The  algorithm  of  [Sni]  is  in  fact  a  special  case  of  this  more  general 
algorithm,  and  so  we  first  summarize  [Sni]  below. 
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1.2.    On-The-FIy  Anomaly  Detection  For  Doall-Endall  Programs 

In  the  anomaly  detection  algorithm  of  [Sni],  each  execution  stream  created  by  a  doall  has 
its  own  display.  The  display  holds  a  read  flag  and  a  write  flag  for  each  location  that  is  being 
monitored.  If  a  monitored  location  is  read  (written)  in  an  execution  stream  T,  the  read  (write) 
flag  is  set  in  the  display  of  T.  If  T  is  a  child  of  Q,  and  if  this  is  the  first  read  (write)  of  the 
monitored  location  within  T,  the  corresponding  bit  must  also  be  set  in  the  display  of  Q,  (and 
the  parent  of  Q,  etc.).  On  subsequent  reads  (writes)  to  the  same  location  in  T,  no  further 
updating  is  required.  For  example,  in  Figure  1,  T3  is  a  child  of  T,.  If  location  A[2]  is  accessed 
for  the  first  time  in  T3,  the  displays  of  both  T3  and  T,  must  be  updated. 

The  set  of  concurrent  streams  that  start  at  a  given  doall  terminate  together  at  a 
corresponding  endail;  at  termination  time,  displays  are  compared  and  any  access  anomalies  are 
revealed.  In  our  example,  the  anomaly  involving  A[2]  is  revealed  when  streams  T,  and  T2  ter- 
minate (see  Figure  2).  The  only  additional  storage  needed  for  this  algorithm  is  a  bitstring 
display  for  each  current  execution  stream. 

A  shortcoming  of  this  method  is  that  when  an  anomaly  is  detected,  the  exact  source  loca- 
tion of  the  statements  that  caused  the  anomaly  is  not  known.  However,  since  the  address  of 
the  variable  involved  and  the  sections  of  code  where  the  anomaly  occurs  are  known,  it  is  pos- 
sible to  perform  more  detailed  subsequent  monitoring  to  precisely  locate  the  anomaly. 

The  more  general  problem  that  we  solve  is  formalized  in  the  sections  that  follow. 

2.    Programming  Model  and  Problem  Definition 

Given  that  parallel  programs  are  often  inherently  non-deterministic,  the  access  anomaly 
detection  algorithm  presented  below  applies  only  to  a  single  execution  instance  of  a  program. 
An  execution  instance  of  a  parallel  program  P  is  written  as  P. 

We  refer  to  the  multiple  execution  threads  of  control  in  a  parallel  program  execution  P  as 
tasks.  Tasks  are  created  and  destroyed  via  a  closed,  nestable  fork-join  construct.  One  or  more 
tasks  are  created  at  a  fork  statement.  Each  fork  statement  has  a  corresponding  join  statement, 
and  all  tasks  created  by  a  specific  fork  statement  terminate  together  at  the  corresponding  join. 
The  parent  task  that  executes  the  fork  is  blocked  until  all  its  children  terminate. 

Additionally,  there  is  a  primitive  for  pairwise  synchronous  task  coordination.  Synchro- 
nous coordination  is  a  symmetric  operation.  If  two  tasks  coordinate  synchronously,  then  nei- 
ther task  can  execute  past  the  coordination  point  until  the  other  task  has  reached  the  coordi- 
nation point. 

Below  we  refer  to  the  fork,  join,  and  coordination  primitive  as  parallel  operations,  and  the 
sequence  of  program  statements  executed  in  a  single  task  between  two  parallel  operations  is 
referred  to  as  a  sequential  block.  A  task  is  a  series  of  sequential  blocks. 
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Let  P  be  an  execution  instance  of  a  parallel  program  P,  and  let  Bx  and  By  be  sequential 
blocks  within  tasks  Tt  and  Tj  of  P  respectively.  Block  By  is  said  to  be  dependent  on  Bx  if  the 
sequences  of  operations  among  the  tasks  in  P  forces  the  execution  of  By  to  occur  after  Bx, 
regardless  of  the  relative  execution  speeds  of  the  tasks  Tj  and  Tj.  Dependences  are  determined 
by  the  particular  parallel  operations  used,  and  are  more  formally  specified  as  follows. 

By  executing  in  Tj  is  dependent  on  Bx  executing  in  Tj  iff  any  of  the  following  hold, 
(i)     T,  is  the  same  task  as  Tj,  and  Bx  executes  before  By; 
(ii)     Block  By  begins  with 

•  a  fork  operation,  and  Tj  is  the  parent  of  Tj, 

•  a.  join  operation,  and  T;  is  one  of  the  tasks  executing  the  join,  or 

•  a  synchronous  coordination  operation  between  Tt  and  Tj,  and  Bx  ends  before  the  coordi- 
nation point;  or 

(iii)  There  is  another  sequential  block  instance  Bz  in  task  Tk  such  that  By  is  dependent  on  Bz 
and  Bz  is  dependent  on  Bx. 

If  block  By  is  not  dependent  on  Bx  and  Bx  is  not  dependent  on  By,  blocks  Bx  and  By  are  con- 
current. 

As  examples,  in  Figure  3,  B2  is  dependent  on  B,,  and  concurrent  with  B4,  B5,  B6,  B7,  and 
B8.  The  block  B3  is  dependent  on  B1(  B2,  and  B4,  and  concurrent  with  all  of  the  other  sequen- 
tial blocks. 

Figure  4(a)  illustrates  synchronous  coordination,  in  which  vertical  lines  represent  tasks, 
and  horizontal  lines  represent  coordination  points.  In  this  figure,  sequential  block  B4  is  depen- 
dent on  B,  and  B3.  Block  B5  is  dependent  on  B,,  B3,  B4,  B6,  B7,  and  B9.  Figure  4(b)  indicates 
for  each  sequential  block  the  set  of  other  sequential  blocks  concurrent  with  it. 

An  access  anomaly  occurs  when  at  least  two  sequential  blocks  are  concurrent,  and  either 
each  block  writes  to  the  same  shared  memory  location,  or  one  block  writes  and  the  other  reads 
the  same  location.  An  on-the-fly  access  anomaly  detection  algorithm  finds  anomalies  in  a  pro- 
gram while  the  program  executes.  To  accomplish  this,  the  algorithm  must  be  able  to  deter- 
mine the  following  information  for  each  sequential  block  B: 

(1)  the  shared  variables  read  and  written  in  B,  and 

(2)  the  set  of  sequential  blocks  that  are  concurrent  with  B. 

While  any  given  sequential  block  is  executing,  each  block  concurrent  with  it  may  have  already 
finished  executing,  be  currently  executing,  or  not  have  commenced  execution;  it  is  necessary  to 
save  the  information  (1)  and  (2)  until  it  is  known  that  all  concurrent  blocks  have  completed. 
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Our  goal  in  constructing  an  efficient  algorithm  is  to  minimize  the  amount  of  additional 
storage  needed  at  any  time.  In  particular,  if  the  amount  of  storage  required  is  not  significantly 
less  than  the  size  of  the  trace  data  needed  for  post-mortem  analysis,  there  is  no  advantage  to 
on-the-fly  analysis. 

For  clarity  of  exposition,  we  restrict  attention  in  Sections  3-5  to  programs  with  synchro- 
nous coordination  operations  only.  In  Section  3,  a  general  high-level  algorithm  is  specified; 
compression  techniques  are  presented  in  Sections  4  and  5.  Section  6  shows  how  to  extend  the 
algorithm  for  programs  with  fork-join  operations  also. 

3.    General  Algorithm 

For  an  execution  instance  P  with  m  synchronous  coordination  operations,  we  choose  a 
total  ordering  of  the  operations  that  is  consistent  with  the  partial  ordering  in  P,  and  present  a 
general  anomaly  detection  algorithm  by  specifying  the  actions  performed  at  each  timestep 
t  =  I,  2,  ...,  m. 

At  time  t,  we  consider  every  sequential  block  Bx  that  is  currently  executing,  starting  to 
execute,  or  has  finished  executing,  and  associate  with  each  such  block  a  shared  variable  set 
Sx(t),  Sx(t)  consists  of  two  subsets:  the  shared  variables  read  in  Bx  and  the  shared  variables 
written  in  Bx  by  time  t.  Two  shared  variable  sets  are  said  to  be  concurrent  if  the  sequential 
blocks  associated  with  them  are  concurrent. 

If  block  Bx  has  finished  executing  before  or  at  time  t,  Sx(t)  is  said  to  be  complete.  If  a 
block  Bx  is  currently  executing  at  time  t,  or  just  beginning  to  execute,  Sx(t)  is  said  to  be  incom- 
plete. Assuming  N  tasks  at  time  t,  there  are  N  blocks  currently  executing,  and  N  associated 
incomplete  shared  variable  sets.  The  N  incomplete  shared  variable  sets  are  all  concurrent  with 
each  other. 

To  illustrate  these  definitions,  in  Figure  5,  each  coordination  point  is  labeled  with  a 
timestep.  At  time  (2),  the  shared  variable  sets  for  blocks  B,,  B4,  B5,  and  B8  are  complete;  the 
shared  variable  sets  for  blocks  B2,  B6,  B9,  and  Bn  are  incomplete;  and  no  other  shared  variable 
sets  exist. 

The  algorithm,  presented  below,  detects  anomalies  by  comparing  complete  concurrent 
shared  variable  sets  at  each  timestep.  At  time  t,  each  newly  completed  shared  variable  set 
Sx(t)  is  compared  with  other  completed  shared  variable  sets  that  are  concurrent  with  it.  How- 
ever, not  all  blocks  concurrent  with  Sx(t)  are  complete  at  time  t. 

To  keep  track  of  concurrent  shared  variable  sets  that  will  complete  at  a  later  time,  we 
associate  a  concurrency  list  with  each  shared  variable  set.  The  concurrency  list  for  a  shared 
variable  set  Sx(t)  is  the  set  of  all  tasks  T  such  that  the  block  B  currently  executing  (or  about  to 
begin  executing)  in  T  at  time  t  is  concurrent  with  Bx.    The  concurrency  list  associated  with 
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Sx(t)  is  written  as  CLx(t). 

A  concurrency  list  decreases  in  size  over  time  as  concurrent  blocks  finish  executing.  If  a 
concurrency  list  CLx(t)  becomes  empty,  the  shared  variable  set  Sx  has  been  compared  with  all 
shared  variable  sets  concurrent  with  it;  thus,  Sx  can  be  deleted. 

We  trace  the  above  steps  for  blocks  B,  and  B2  in  Figure  5.  Initially,  there  are  four 
incomplete  shared  variable  sets,  with  associated  concurrency  lists  CL,(0)  =  {T2,  T3,  T4}, 
CL4(0)  =  {T„  T3,  T4},  CL8(0)  -  {T„  T2,  T4},  and  CL„(0)  -  (T„  T2,  T3}.  At  time  (1),  B, 
and  B4  are  finished,  and  so  S^l)  is  compared  with  S4(l).  However,  there  are  other  blocks  exe- 
cuting, or  not  yet  started,  whose  shared  variable  sets  must  be  compared  with  S,,  namely  B8 
and  Bn,  but  the  comparison  can  only  be  made  at  a  later  time.  The  concurrency  list  CL,(1)  for 
shared  variable  set  S,(l)  is  {T3,  T4},  since  after  (1),  the  executing  blocks  Bg  and  Bn  are  con- 
current with  Blt  but  the  executing  blocks  B2  and  B5  are  not  concurrent  with  B,.  The  con- 
currency list  for  the  new  incomplete  shared  variable  set  S2(l)  is  {T2,  T3,  T4}. 

At  time  (2),  Bg  finishes,  so  S,(2)  can  be  compared  with  S8(2).  Since  B,  is  not  concurrent 
with  B9,  CL,(2)  becomes  {T4}.   On  the  other  hand,  CL2(2)  is  the  same  as  CL^l). 

At  time  (3),  B2  finishes,  and  is  compared  with  B5,  B6,  and  B8.  The  concurrency  list 
CL2(3)  is  {T3,T4}.   The  concurrency  list  CL,(3)  equals  (11^(2). 

Finally  at  time  (4),  S,(4)  is  compared  with  Sn(4).  All  blocks  concurrent  with  B,  have 
finished  executing,  the  concurrency  list  (11^(4)  is  empty,  and  so  S,(4)  is  deleted.  The  con- 
currency list  CL2(4)  has  not  changed. 

Note  that  the  size  of  a  concurrency  list  can  be  at  most  N-l,  and  so  there  are  at  most  2N 
distinct  concurrency  lists.  A  concurrency  list  of  size  N- 1  is  always  associated  with  an  incom- 
plete shared  variable  set  Sx(t);  CLx(t)  is  {Tj  :  j  ¥*  i},  where  Bx  is  the  current  block  in  task  T{. 
Concurrency  lists  associated  with  complete  shared  variable  sets  at  time  t  are  always  of  size  less 
than  N-l. 

This  algorithm  is  described  more  precisely  below. 

Algorithm.  At  each  timestep  t,  suppose  a  synchronous  coordination  operation  is  executed 
by  block  Bx  in  task  T,  and  block  By  in  task  T:.  The  following  four  steps  are  performed: 

(1)  Compare.  To  check  for  anomalies,  compare  Sx(t)  with  each  shared  variable  set  Sz(t), 
provided  that  Sz(t)  is  complete  and  T;  is  in  the  concurrency  list  CLz(t-  1);  compare 
Sy(t)  with  each  complete  shared  variable  set  Sz(t),  provided  that  Tj  is  in  the  con- 
currency list  CLz(t-  1). 

(2)  Initialize.  For  each  new  sequential  block  Bz  that  begins  executing  at  t,  a  new  incom- 
plete shared  variable  set  Sz(t)  and  concurrency  list  CLz(t)  are  generated. 
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(3)  Update.  Update  all  concurrency  lists  associated  with  shared  variable  sets  that  are  com- 
plete at  time  t,  to  reflect  the  transition  from  time  t  -  1  to  t.  This  step  is  specified  more 
precisely  below. 

(4)  Delete.  Delete  any  shared  variable  sets  whose  concurrency  lists  have  become  empty  as 
a  result  of  step  (3). 

The  concurrency  list  update  rule  (step  (3))  is  derived  from  the  following  lemma. 

Lemma  1.  Suppose  block  Bx  in  task  Tj  and  By  in  task  Tj  coordinate  synchronously  at  time  t. 
Let  Bx'  be  the  new  block  executing  in  Tt  after  the  coordination  point.  Let  Bz  be  another  block 
that  is  complete  at  time  t.  If  Bx  is  concurrent  with  Bz  and  By  is  not  concurrent  with  Bz,  then 
the  new  block  Bx'  is  also  not  concurrent  with  Bz.  Conversely,  if  Bx  and  By  are  both  concurrent 
with  Bz,  and  Tj  and  Tj  coordinate,  then  the  new  incomplete  blocks  Bx'  and  By'  executing  after 
the  coordination  are  also  concurrent  with  Br 

Proof.  Since  By  is  not  concurrent  with  Bz,  By  depends  on  the  block  Bz.  (Bz  cannot  be  depen- 
dent on  By,  since  Bz  has  finished  executing  first.)  Bx'  is  dependent  on  By,  and  so  it  is  also 
dependent  on  Bz.  The  converse  also  follows  immediately  from  the  definition  of  dependency 
and  concurrency.  □ 

The  update  rule  for  synchronous  coordination  operations  is  therefore  specified  as  follows: 
(3a)  Update.  For  each  complete  shared  variable  set  Sz(t): 

(i)  if  Tj  is  in  CLz(t-  1)  and  Ts  is  not,  CLz(t)  =  CLz(t-  1)  -  {TJ;  similarly, 

(ii)  if  Tj  is  in  CLz(t-  1)  and  Tj  is  not,  CLz(t)  =  CLz(t-  1)  -  {Tj}; 

(iii)  otherwise,  CLz(t)  =  CLz(t-  1). 

Note  that  updating  concurrency  lists  involves  at  most  a  linear  scan  over  all  lists.  It  is  not 
necessary  to  know  the  history  of  synchronous  coordination  operations  to  perform  the  update 
step. 

If  shared  variable  sets  are  never  deleted,  this  algorithm  is  not  significantly  better  than 
post-mortem  analysis.  Figure  5  illustrates  when  the  deletion  step  can  be  applied,  namely,  after 
time  (4)  when  all  tasks  have  synchronized  with  each  other.  Overall,  however,  the  delete  step 
can  only  be  applied  under  limited  circumstances.  If  there  is  no  N-way  task  synchronization 
pattern  such  as  that  shown  in  Figure  5,  no  shared  variable  sets  are  ever  deleted.  Improve- 
ments derive  from  merging,  which  reduces  the  total  number  of  shared  variable  sets,  and  sub- 
traction, which  reduces  the  size  of  each  shared  variable  set. 
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4.    Refinement  1:  Shared  Variable  Set  Merging 

In  this  version,  shared  variable  sets  are  represented  as  bit  strings,  so  that  all  shared  vari- 
able sets  are  the  same  size  and  proportional  to  the  number  of  monitored  variables.  A  new 
merge  step  is  introduced  to  reduce  the  number  of  shared  variable  sets,  as  follows: 

Shared  variable  sets  are  combined  by  a  set  union  operation  at  time  t  iff  their  associated  con- 
currency lists  are  equal. 

We  note  that  a  bitstring  representation  is  convenient  both  for  performing  the  comparison  step 
described  in  Section  3  and  for  the  set  union  operation  used  in  this  merge  step. 

With  merging,  all  concurrency  lists  are  unique,  and  so  the  number  of  shared  variable  sets 
is  clearly  bounded  by  the  number  of  possible  concurrency  lists.  While  the  space  of  possible 
concurrency  lists  is  exponential,  there  is  in  fact  a  much  better  0(N2)  bound  on  the  number  of 
distinct  concurrency  lists,  where  N  is  the  number  of  tasks.  This  result  follows  from  Lemma  2: 

Lemma  2.  Consider  a  task  T  at  time  t  and  the  sequential  blocks  B,,...,  Bn  in  T,  such  that 
B,  is  the  first  block  executed,  Bn  is  the  current  block  executing,  and  Bj  +  1  is  executed  after  Bj. 
For  1  ^  k  <  n,  the  concurrency  list  for  block  Bk  is  either  equal  to  or  a  proper  subset  of  the  con- 
currency list  for  block  Bk  + , . 

Proof  Suppose  not.  There  is  an  executing  block  Bj  in  another  task  that  is  not  concurrent 
with  Bk+1,  but  is  concurrent  with  Bk.  It  must  be  the  case  that  Bj  is  dependent  on  Bk+1,  since 
Bj  is  executing  and  Bk+I  is  finished  executing.  (Bk  +  I  cannot  be  Bn  since  Bn  is  concurrent  with 
all  other  executing  blocks.)  If  Bj  is  dependent  on  Bk+1,  it  must  also  be  dependent  on  Bk.  □ 

Since  the  size  of  the  concurrency  list  associated  with  the  currently  executing  block  for 
each  task  T  is  N-l,  the  total  number  of  distinct  concurrency  lists  associated  with  blocks  in  T  is 
N-l.  It  follows  that  the  total  number  of  concurrency  lists  at  any  time  is  bounded  by  N2. 
Therefore,  for  any  program,  regardless  of  the  pattern  of  synchronous  coordination  operations, 
the  amount  of  storage  required  for  shared  variable  sets  is  bounded  and  depends  only  on  the 
number  of  tasks  and  the  number  of  monitored  variables. 

We  make  the  following  changes  to  the  formulation  in  Section  3: 

(a)  A  shared  variable  set  is  associated  with  one  or  more  sequential  blocks,  and  consists  of 
two  subsets:  the  set  of  all  shared  variables  read  in  any  of  the  associated  blocks,  and  the 
set  of  all  shared  variables  written  in  any  of  the  associated  blocks.  A  shared  variable  set 
associated  with  blocks  Bx  ,  ...,  Bx  at  time  t  is  written  as  Sx     x  (t). 

(b)  We  add  the  following  additional  algorithm  step: 

(5)  Merge.  For  any  two  complete  shared  variable  sets  Sx  x  (t)  and  Sv  v  (t)  such  that  their 
associated  concurrency  lists  CLX     x  (t)  and  CLV     v  (t)  are  equal,  replace  Sx     x  (t)  and 
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Sv  v  (t)  by  a  new  merged  shared  variable  set  S_  ,  Y  v  (t).  The  merged  set  is  formed 
by  taking  the  union  of  the  subsets  of  Sx  x  (t)  and  Sv  Y  (t);  that  is,  Sx  x  Y  v  (t)  con- 
sists  of  the  set  of  variables  read  in  any  of  Bx  ,  .  .  .  ,  Bx    or  By  ,  .  .  .  ,  Bv    and  the  set  of 


'y,' 
variables  written  in  any  of  Bx  ,  .  .  .  ,  Bx   or  Bv Bv  . 

Merging  is  justified  for  the  following  reasons.  Suppose  two  shared  variable  sets  Sx(t)  and 
Sy(t)  have  equal  concurrency  lists.  If  there  is  no  merging,  the  concurrency  list  for  these  shared 
variable  sets  will  remain  equal  through  any  subsequent  operation  executed  in  the  program. 
This  result  follows  from  the  update  rule  given  in  Section  3.  Any  shared  variable  set  Sz(t'), 
t'>t,  subsequently  compared  with  Sx(t')  will  also  be  compared  with  Sy(t'),  and  vice- versa. 
Therefore,  any  error  detected  without  merging,  will  be  detected  with  merging. 

4.L   Applications 

We  return  to  Figure  5  which  shows  four  tasks  T,,...T4  executing  concurrently.  The 
shared  variable  sets  and  their  concurrency  lists  after  each  timestep  are  shown  in  Figure  6. 
Shared  variable  sets  that  are  labeled  by  a  "*"  are  incomplete.   The  concurrency  list 

IT-    T    ...  T) 

is  abbreviated  as: 

{i|,  i2,  ,  •  •  .  ,  ir}. 

In  this  example,  there  are  4  incomplete  shared  variable  sets  before  any  coordinations.  At 
time  (1),  two  shared  variable  sets  complete  and  are  merged  to  form  shared  variable  set  S,  4.  At 
time  (2),  task  T3  is  removed  from  concurrency  list  CL14,  because  block  B9  is  not  concurrent 
with  Bj  and  B4.  Merged  shared  variable  set  S5g  is  formed.  At  time  (3),  S2ig  is  a  new  merged 
shared  variable  set,  and  task  T,  is  removed  from  concurrency  list  CL5^.  The  concurrency  lists 
CL1>4  and  CL58  are  now  equal,  so  there  is  an  additional  merging  of  shared  variable  sets  S,  4 
and  S5  8  at  this  point.  At  time  (4),  the  concurrency  list  of  this  merged  shared  variable  set 
Si, 4,5 ,8  becomes  empty,  and  so  the  shared  variable  set  is  deleted. 

In  general,  assuming  N  tasks,  every  time  a  coordination  operation  occurs  at  time  t 
between  two  tasks  Tj  and  Tjt  two  shared  variable  sets  Sx(t)  and  Sy(t)  are  completed.  (Two  new 
shared  variable  sets  Sx    (t)  and  Sy    (t)  are  also  created,  since  tasks  T  and  T;   continue  to  exe- 

new  *  new  * 

cute.)  The  concurrency  lists  CLx(t)  and  CLy(t)  are  of  size  N-2  and  are  equal: 

CLx(t)  =  CLy(t)  =  {Tk:  k  #  i  and  k  *  ]}. 

Therefore,  Sx(t)  and  Sy(t)  are  merged  into  Sxy(t).   At  least  one  merge  operation  is  always  per- 
formed at  each  timestep,  so  that  the  number  of  shared  variable  sets  increases  by  at  most  one 
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per  coordination  operation.  Moreover,  since  each  complete  shared  variable  set  is  a  merger  of 
at  least  two  shared  variable  sets,  a  more  precise  bound  on  the  number  of  shared  variable  sets 
is  N2/2. 

In  fact,  the  worst  case  coordination  pattern  that  has  been  constructed  has  N2/4  +  N-  1 
shared  variable  sets2;  the  construction  for  an  execution  with  eight  tasks  is  shown  in  Figure  7. 
In  this  figure,  there  are  15  coordinations.  The  values  of  concurrency  lists  at  time  t=  14  appear 
over  the  horizontal  lines  representing  synchronous  coordination  points.  Each  concurrency  list 
belongs  to  the  two  blocks  that  executed  the  operation.  Any  subsequent  operation  at  t=16 
causes  a  shared  variable  set  to  be  deleted  or  merged. 

4.2.   General  Comments 

While  the  worst  case  storage  is  quadradic,  simulations  with  randomly  generated  coordina- 
tion patterns  show  that  this  worst  case  is  not  easy  to  obtain.  A  simulation  with  100  tasks  syn- 
chronizing randomly  100,000  times  produces  a  maximum  number  of  502  shared  variable  sets, 
which  is  achieved  after  51,828  coordination  operations.  Testing  the  algorithm  on  real  pro- 
grams will  give  a  better  understanding  of  the  storage  requirements  for  typical  programs. 

Merging  loses  diagnostic  precision.  If  an  access  anomaly  is  detected  in  a  merged  shared 
variable  set,  it  is  not  necessarily  obvious  in  which  block  one  (or  more)  or  the  conflicting 
reference(s)  occurred.  As  mentioned  in  Section  1.2,  since  one  of  the  erroneous  sequential 
blocks  and  the  exact  variable  involved  in  the  anomaly  is  revealed  by  the  algorithm,  subsequent 
monitoring  can  be  performed  to  precisely  pinpoint  the  anomaly. 

The  algorithm  of  Snir  described  in  Section  1.2  is  a  special  case  of  this  refined  version  for 
programs  with  only  a  fork-join  construct.  (See  Section  6  for  a  discurssion  of  fork-join  update 
rules.)  At  each  join  operation,  the  child  shared  variable  sets  are  merged  with  the  parent  shared 
variable  set;  the  delete  step  is  never  applicable.  Concurrency  lists  need  not  be  kept  explicitly 
in  [Sni]  because  of  the  semantics  of  fork -join. 

5.    Refinement  II:  Shared  Variable  Set  Subtraction 

As  an  alternative  compression  technique,  the  size  of  individual  shared  variable  sets  can  be 
reduced.  Variables  are  removed  from  shared  variable  sets  by  a  new  subtraction  step,  as  fol- 
lows: 

A  monitored  variable  v  is  removed  from  the  read  (resp.  write)  subset  of  Sy(t)  iff  v  is  in  the 
read  (resp.  write)  subset  o/Sx(t)  and  CLy(t)  is  a  subset  of  CLx(t). 


iThis  construction  is  due  to  Peter  Frankl  [Fr]. 
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We  justify  subtraction  as  follows.  Suppose  CLy(t)  is  a  subset  of  CLx(t).  Subsequently, 
any  shared  variable  set  Sz(t'),  t'  >  t  compared  with  Sy(t')  must  also  be  compared  with  Sx(t'). 
From  the  update  rule  in  Section  3,  CLy  remains  a  subset  of  CLX  after  any  subsequent  coordi- 
nation operation.  Therefore,  if  v  is  a  variable  in  the  read  (resp.  write)  subset  of  both  Sx(t') 
and  Sy(t'),  and  if  there  is  an  access  anomaly  conflict  involving  v  between  Sx(t')  and  Sz(t'),  there 
is  also  an  access  anomaly  between  Sy(t')  and  Sz(t').  After  subtraction,  the  latter  anomaly  is 
detected. 

Consider  the  shared  variable  sets  S_  .....  SY   that  are  associated  with  a  task  T  at  time  t. 

If  the  subtraction  step  is  performed,  it  follows  from  Lemma  2  that  v  can  belong  to  the  read 
subset  of  only  one  shared  variable  set  S»  (t),  1  ^  i  ^  n  Therefore,  the  total  space  needed  to  store 

read  information  is  bounded  by  V  x  N,  where  V  is  the  number  of  variables  being  monitored. 

Even  less  space  is  needed  for  write  subsets—  the  storage  for  write  information  is  bounded 
by  V.  More  specifically,  for  each  variable  v,  v  belongs  to  the  write  subset  of  at  most  one 
shared  variable  set.   This  follows  from  the  argument  below. 

Suppose  z  is  written  by  block  B„  in  task  T;  at  time  t,  and  suppose  v  is  in  the  write  subset 
of  Sy(t).  If  By  is  concurrent  with  Bx  there  is  an  access  anomaly.  If  By  is  not  concurrent  with 
Bx,  then  CLy(t)  is  a  subset  of  CLx(t).  Therefore,  v  is  removed  from  Sy(t)  by  the  subtraction 
step. 

With  the  subtraction  technique,  a  different  implementation  strategy  is  possible:  shared 
variable  set  data  are  not  kept  per  se,  but  rather  information  is  distributed  among  the  moni- 
tored variables.  For  each  variable  v,  we  associated  a  list  of  references  to  concurrency  lists  CLX 
such  that  v  is  read  in  Bx.  Because  of  subtraction,  each  list  is  bounded  by  N.  We  also  associate 
with  v  a  reference  to  concurrency  list  CLX  such  that  the  most  recent  write  operation  was  per- 
formed in  Bx.  Every  time  a  monitored  variable  v  is  read  or  written,  a  test  is  made  for  a  possi- 
ble anomaly,  and  the  information  associated  with  v  is  updated.  The  compare  step  (1)  in  the 
algorithm  in  Section  3  is  thereby  eliminated.  (Operations  on  Concurrency  list  described  in 
steps  (2)-(4)  remain  unchanged.) 

In  Figure  8,  variable  v  is  read  in  blocks  B,,  B2,  and  B5  and  written  in  block  Bu.  The 
second  read  of  v  in  block  B2  supercedes  the  read  of  v  in  block  Blt  so  by  subtraction,  the  con- 
currency list  CL,  is  replaced  by  CL2  in  the  read  list  of  v.  After  the  second  synchronization 
between  tasks  T,  and  T2,  CL5  is  removed  from  the  read  list  of  v.  Finally,  at  the  write  opera- 
tion in  Bn,  T4  is  in  the  concurrency  list  CL^t),  so  an  anomaly  will  be  detected. 

While  version  II  is  O(N)  and  version  I  is  0(N2)  in  space,  version  II  may  in  fact  be  slower 
because  it  may  be  more  costly  to  implement  subtraction.  Moreover,  space  needed  for  con- 
currency lists  is  not  bounded.  It  is  possible  to  combine  version  I  and  version  II  to  obtain  a 
hybrid  algorithm  in  several  ways: 
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(1)  A  merge  step  can  be  added  to  version  II,  whereby  concurrency  lists  are  combined 
according  to  the  rule  described  in  Section  4  (but  no  actual  shared  variable  sets  are 
maintained.)  Instead  of  shared  variable  sets,  references  to  concurrency  lists  are  associ- 
ated with  each  monitored  variable.  A  union-find  approach  [AHU]  can  then  be  applied 
to  manage  these  reference  lists  for  all  monitored  variables  v. 

(2)  A  subtraction  step  can  be  added  to  version  I,  whereby  a  shared  variable  set  difference 
operation  is  performed  after  each  merge  operation.  In  this  case,  the  shared  variable  set 
representation  must  be  adapted  to  exploit  sparse  sets. 

Any  combination  of  these  approaches  result  in  more  storage  compression  than  either 
approach  alone. 

6.   Adding  Fork- Join  Operations 

We  next  consider  programs  with  both  synchronous  coordination  and  nested  fork-join 
operations.  To  extend  the  algorithm  presented  in  Section  3,  we  specify  concurrency  list  update 
rules  for  these  operations: 

Fork.  Suppose  task  Tj  executes  a  fork,  generating  new  tasks  T» , .  .  . ,  T:  at  time  t. 

(3b)  Update.  For  each  complete  shared  variable  set  Sz(t): 

(i)  if  CL2(t-  1)  includes  lv  CLz(t)  =  CLz(t-  1)  -  {T,}  y  {T}] Tin}; 

(ii)  otherwise,  CLz(t)  =  CLz(t-  1). 

Since  concurrency  lists  contain  only  tasks  that  are  currently  running,  and  since  task  T,  is 
blocked  until  the  subsequent  join  operation,  this  rule  replaces  references  to  the  task  Tt  by 
references  to  the  forked  children  tasks.  The  parent  Tt  may  possibly  be  added  again  at  the 
corresponding  join  operation. 

Join.  The  join  operation  is  similar  to  a  multiway  synchronous  coordination  point. 
Therefore,  if  the  tasks  T« , .  . .  ,TL  execute  a  join  together,  the  update  operation  is  similar  to 

the  synchronous  coordination  rule: 

(3c)  Update.  For  each  complete  shared  variable  set  Sz(t): 

(i)  If  all  of  Tj ,  .  .  .  ,Tj  are  in  the  concurrency  list  CLz(t—  1), 

CLz(t)  =  CLz(t-l)  -  {T^ Tin}   u   {TJ; 

(ii)  otherwise,  CLz(t)  =  CLz(t-  1)  -  {T,,  .  .  .  ,1,  }. 

i  n 

These  rules  are  illustrated  in  the  next  Section. 
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6.1.  Fork- Join  Example 

Consider  the  program  graph  in  Figure  9(a).  The  concurrency  lists  at  each  parallel  opera* 
tion  step  are  shown  in  Figure  9(b).  Initially,  there  are  four  incomplete  shared  variable  sets. 
At  time  (1),  three  new  tasks  are  created,  and  shared  variable  set  S,  completes.  At  time  (2) 
after  the  first  coordination  between  tasks  T6  and  T3,  concurrency  lists  for  S4  and  S,0  are  equal 
and  so  these  states  are  merged.  Task  T3  is  removed  from  the  concurrency  list  of  shared  vari- 
able set  S,,  as  well  as  from  the  concurrency  list  CL4I0.  After  the  second  coordination  between 
tasks  T6  and  T2,  T2  is  removed  from  the  concurrency  lists  for  shared  variable  sets  S,  and  S4  10. 
Finally,  at  the  first  join  point,  since  tasks  T5,  T6,  and  T7  terminate,  they  are  removed  from  all 
concurrency  lists.  Shared  variable  sets  S)  and  S4I0  have  equal  concurrency  lists  and  are 
merged.   The  final  result  is  shown  at  time  (4). 

6.2.  Bounds  on  the  Number  of  Distinct  Concurrency  Lists  with  Nested  Fork- 
Joins 

Given  a  program  execution  instance  P  with  both  synchronous  coordination  operations 
and  nested  fork-join  operations,  the  number  of  tasks  executing  varies  over  time.  We  will  say 
that  two  tasks  Tt  and  Tj  are  concurrently  executable  if  for  any  sequential  block  Bx  in  T;  and  By 
in  Tj,  Bx  and  By  are  concurrent.  A  covering  of  P  is  an  assignment  AS,  the  domain  of  which  are 
the  tasks  in  P,  such  that  if  Ts  and  Tj  are" concurrently  executable,  then  AS(Tj)  and  AS(Tj)  are 
not  equal.  In  Figure  1,  there  is  a  covering  of  cardinality  4,  and  in  Figure  8(a),  there  is  a  cov- 
ering of  cardinality  6.  Let  M  be  the  cardinality  of  a  minimum  covering  for  P.  Then  the  upper 
bound  on  the  number  of  shared  variable  sets  that  can  be  obtained  at  any  time  is  M2/2.  The 
proof  of  this  bound  is  given  in  the  Appendix,  and  it  is  obtained  using  Lemma  2.  Below,  we 
remark  on  how  big  M  can  be. 

Let  N  be  the  maximum  concurrency  of  P,  where  the  maximum  concurrency  is  defined  as 
the  size  of  the  largest  set  of  tasks  that  are  concurrently  executable.   It  is  clear  that  M  ^  N. 

Let  W  be  the  maximum  width  of  P,  where  the  maximum  width  of  an  execution  P  is 
defined  as  follows: 

(i)     The  maximum  width  of  a  task  T  that  does  not  executed  any  fork  operations  is  1 . 

(ii)    The  maximum  width  of  a  fork  operation  F  is  the  sum  of  the  maximum  widths  of  all  tasks 
created  by  the  fork. 

(iii)    For  a  task  T  that  is  a  parent  task,  let  F  be  the  fork  operation  executed  by  T  that  has  the 
largest  maximum  width.   The  maximum  width  of  T  is  the  maximum  width  of  F. 

Finally,  the  maximum  width  W  of  P  is  the  sum  of  the  maximum  widths  of  all  tasks  ini- 
tially executing  in  P.  It  is  clear  that  W  is  greater  than  or  equal  to  the  maximum  concurrency 
M.   (In  Figure  8(a),  the  maximum  concurrency  equals  the  maximum  width.)  Furthermore,  it  is 
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not  hard  to  generate  a  covering  for  P  of  size  W.    Therefore,  N  ^  M  ^  W.    We  conjecture, 
although  no  proof  is  currently  provided,  that  in  fact  M  =  N. 

7.   Applying  The  Algorithm  To  Real  Language  Constructs 

While  the  programming  model  that  we  have  described  is  somewhat  simplified  compared 
with  real  parallel  programming  languages,  the  algorithm  presented  can  easily  be  adapted  to 
handle  many  common  parallel  programming  language  synchronization  and  coordination  primi- 
tives. We  consider  several  language  primitives  below. 

7.1.  Ada  Rendezvous 

The  Ada  rendezvous  [DOD]  is  a  pairwise  synchronous  coordination  primitive  which  is 
asymmetric.  When  a  rendezvous  occurs,  the  caller  waits  while  the  called  task  executes  the 
body  of  the  accept  statement.  When  the  called  task  finishes  the  accept  statement,  the  caller 
continues.   The  called  task  can  engage  in  other  rendezvous  before  returning. 

To  use  our  algorithm,  a  rendezvous  is  modeled  as  two  coordinations  between  the  caller 
task  Tj  and  the  called  task  Tj.  The  beginning  of  the  rendezvous  occurs  at  time  t,  and  the  end 
of  the  rendezvous  occurs  at  later  time  t'.  There  is  no  block  Bx  executed  in  the  caller  Tt 
between  t  and  t',  so  the  associated  shared  variable  set  is  empty.  Figure  9  illustrates  a  program 
execution  instance  with  five  tasks  and  three  rendezvous.  First,  tasks  T,  and  T2  rendezvous, 
where  T2  is  the  caller.  Then  task  T3  and  T4  rendezvous.  Before  T4  returns,  T4  and  T5  rendez- 
vous, where  T4  is  the  caller.  This  execution  is  treated  as  a  program  with  six  synchronous 
coordination  operations. 

7.2.  Barrier  Synchronization 

A  barrier  synchronization  is  a  multiway  synchronous  coordination  point.  More 
specifically,  if  a  task  reaches  an  N-way  barrier  synchronization  point  during  execution,  it  waits 
until  N-l  other  tasks  also  reach  the  barrier.  For  the  anomaly  detection  algorithm,  an  N-way 
barrier  synchronization  is  easily  handled  as  2X(N-1)  pairwise  synchronous  coordination 
operations.  More  specifically,  tasks  T,,  .  .  .  ,  TN  executing  a  barrier  is  treated  as  the  following 
sequence  of  synchronous  coordinations  among  the  tasks: 
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Tj  and  T2 
T2  and  T3 


TN_,  andTN 
TN_,  andTN_2 


T3  and  T2 

T2  and  T,. 

The  barrier  synchronization  update  rule  is  similar  to  the  join  operation  rule. 

7.3.   Asynchronous  Coordination 

Many  coordination  primitives,  such  as  doacross  coordination,  message  send/receive,  and 
locking  operations,  cannot  be  modeled  by  synchronous  coordination  operations,  because  they 
are  inherently  asynchronous.  An  asynchronous  pairwise  coordination  is  an  asymmetric  coordi- 
nation between  two  tasks:  a  sender  and  a  receiver.  If  the  receiver  reaches  the  coordination 
point  first,  the  receiver  waits  until  the  sender  reaches  the  coordination  point.  However,  if  the 
sender  arrives  first,  the  sender  does  not  wait  for  the  receiver.  A  message  is  posted  to  notify 
the  receiver  to  proceed  executing. 

Asynchronous  coordination  is  illustrated  in  Figure  10(a).  Arrows  indicate  asynchronous 
coordination  operations,  where  the  arrow  originates  at  the  sender.  Unlike  the  synchronous 
case,  the  block  executing  in  the  sender  after  a  coordination  occurs  is  concurrent  with  the  block 
executing  in  the  receiver  before  the  coordination  occurs.  Figure  10(b)  indicates  which  blocks 
in  Figure  10(a)  are  concurrent  with  each  other. 

It  should  be  possible  to  extend  the  algorithm  in  Section  3  to  handle  asynchronous  coordi- 
nation, although  this  is  beyond  the  scope  of  this  paper.  With  an  extended  algorithm,  a  variety 
of  asynchronous  programming  language  primitives  can  be  handled.  For  example,  for  locking 
operations,  the  task  executing  the  unlock  operation  is  considered  to  be  the  sender,  and  the  task 
executing  the  lock  operation  is  the  receiver.   Other  primitives  are  treated  in  a  similar  fashion. 

8.    Conclusion 

The  anomaly  detection  algorithm  presented  provides  the  basis  for  a  monitoring  tool  that 
automates  an  important  part  of  the  iterative  debugging  process.  When  integrated  with  other 
parallel  debugging  tools,  such  as  a  trace-and-replay  system  (e.g.  [Mil]  [LeB])  which  provides 
reproducibility,  and  static  analysis  which  reduces  the  number  of  variables  that  need  to  be  mon- 
itored, the  effectiveness  of  the  method  will  be  further  enhanced.  Alternatively,  this  algorithm 
can  be  used  to  analyize  shared  variable  trace  data  in  a  post-mortem  phase.  For  certain  real- 
time programs,  it  is  not  be  possible  to  monitor  on-the-fly,  and  the  techniques  described  are 
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also  beneficial  for  processing  large  traces  subsequent  to  execution. 

Experience  with  these  methods  is  required  to  determine  appropriate  data  structures,  addi- 
tional algorithm  refinement,  and  other  profitable  compression  techniques.  The  design  and 
engineering  of  practical  monitoring  tools  for  finding  parallel  bugs  in  shared  memory  programs 
is  our  eventual  goal. 
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APPENDIX 

We  prove  an  upper  bound  on  the  number  of  shared  variable  sets  obtainable  for  the  gen- 
eral case  of  programs  with  both  fork-join  and  synchronous  coordination  operations. 

Lemma  3.  Let  P  be  an  execution  instance,  and  let  M  be  the  cardinality  of  a  minimum 
covering  for  P.  The  number  of  shared  variable  sets  that  can  be  obtained  at  any  time  is 
bounded  by  M2/2. 

Proof.    Consider  two  program  execution  instances  P  and  Q.    We  will  say  that  Q  simulates 
Piff 
(i)     for  every  operation  O  in  P,  there  are  one  (or  more)  operations  in  Q  corresponding  to  O; 

and 
(ii)    if  tj.—.tn,  is  a  total  ordering  of  the  operations  in  P  consistent  with  the  partial  ordering  of 

operations  in  P  and  CLp(t)  is  a  concurrency  list  in  P  at  time  t,  then  for  the  corresponding 

total   ordering   of  operations   in  Q,   there  is   a  unique  concurrency   list  CLQ(t)   in  Q 

corresponding  to  CLp(t). 

To  prove  Lemma  3,  we  show  that  for  an  execution  instance  P  that  includes  fork,  join,  and 
synchronous  coordination  operations,  there  is  an  execution  instance  Q  that  simulates  P,  such 
that  Q  consists  of  M  tasks  executing  only  synchronous  coordination  operations  (and  no  nested 
fork-join  operations).  It  then  follows  from  Lemma  2  that  the  number  of  shared  variable  sets 
in  P  at  any  time  is  bounded  by  M2/2. 

Let  R,,  ...  ,  Rjy,  be  a  set  of  tasks  not  in  P,  and  let  AS  be  an  assignment  of  tasks  in  P  to 
Rj,  .  .  .  ,  RM,  such  that  AS  is  a  covering  for  P.   The  simulation,  which  contains  only  synchro- 
nous coordinations,  is  performed  as  follows: 
(i)      For  each  synchronous  coordination  operation  between  Tj  and  7,  in  P,  the  corresponding 

operation  in  Q  is  a  synchronous  coordination  operation  between  R^  and  R^,  where 

Ri(  =  ASO\)  and  Rjs  =  AS(T^. 

(ii)  For  each  fork  operation,  such  that  T,  creates  tasks  Tj ,  .  .  .  ,  Tj  ,  the  corresponding  opera- 
tions in  Q  is  the  set  of  synchronous  coordinations  among  the  assigned  tasks 
AS(Tj),  AS(Tj ) AS(Tj )  that  implements  a  barrier  synchronization  among  these 

tasks  (see  Section  6.2). 
(iii)    For  each  join  operation  executed  by  tasks  Tj ,  .  .  .  ,Tj ,  where  Tj  is  the  task  executing 
after  the  join,  the  corresponding  operations  in  Q  are  again  the  set  of  synchronous  coordi- 
nation operations  among  the  assigned  tasks  AS(Tj),  AS(TS ),...,  AS(T, )  implementing 

a  barrier  synchronization. 
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Let  t,,  .  .  .  ,tm  be  a  total  ordering  of  operations  in  P  consistent  with  the  partial  ordering 
of  operations  in  P.  (For  simplicity  of  exposition,  if  an  operation  performed  in  P  at  time  t 
corresponds  to  a  set  of  operations  in  Q,  the  set  of  operations  in  Q  are  considered  all  to  be  per- 
formed at  time  t.) 

Claim  1.  For  each  concurrency  list  CLp(t),  there  is  a  corresponding  concurrency  list 
CLQ(t)  such  that  for  each  T  in  CLp(t),  AS(T)  is  in  CLQ(t). 

At  time  t  =  0,  the  claim  clearly  is  true. 

Assuming  the  hypothesis  is  true  at  time  t,  we  consider  concurrency  lists  at  time  t+  1.  For 
each  new  concurrency  list  CLp(t  +  1),  associated  with  a  new  block  executing  in  task  T,  we  let 
the  corresponding  concurrency  list  CLQ(t  +  1)  be  the  concurrency  list  associated  the  new  block 
in  task  AS(T). 

Suppose  CLp(t)  is  associated  with  a  complete  shared  variable  set  Sx(t),  and  CLQ(t)  is 
associated  with  Sy(t)  in  Q.  Concurrency  lists  CLp(t)  and  CLQ(t)  are  updated  according  to  the 
operation  performed  at  t  +  1 : 

Case  1.  The  operation  at  time  t+1  is  a  synchronous  coordination.  Applying  the  update 
rule  to  both  concurrency  lists  CLp(t)  and  CLQ(t),  it  is  clear  that  T  is  removed  from  CLp(t)  iff 
AS(T)  is  removed  from  CLQ(t).  Therefore,  the  claim  holds  at  t+  1. 

Case  2.  The  operation  at  time  t  + 1  is  a  fork  executed  by  task  Tj,  creating  k  tasks.  Let  Tj 
be  a  task  created  by  the  fork.  If  T;  is  in  CLp(t),  then  by  the  update  rule  for  fork,  Tj  is  in 
CLp(t+l).  To  prove  the  claim,  we  must  show  that  AS(Tj)  is  in  CLQ(t).  Because  Ts  is  in 
CLp(t),  the  block  Bz  executing  initially  in  Tj,  is  concurrent  with  block  Bx  associated  with  CLP. 
For  any  task  T  that  exists  between  the  time  that  Bx  finishes  and  time  t+  1,  AS(T)  ?*  AS(Tj). 
Otherwise,  the  assignment  AS  is  not  a  covering  for  P.  Therefore,  AS(Tj)  cannot  engage  in  any 
operations  in  Q  between  the  time  that  By  associated  with  CLQ(t)  finishes  and  t+  1,  so  that 
AS(Tj)  must  be  in  the  concurrency  list  CLQ(t+  1). 

Case  3.  The  operation  at  time  t+  1  is  a  join.  If  Tj  is  added  to  CLp(t+  1)  by  the  update 
rule,  then  the  blocks  executing  the  join  operation  are  all  concurrent  with  Bx  associated  with 
CLp(t+  1).  Therefore,  AS(Ti)  cannot  be  assigned  to  any  task  T  existing  between  the  time  that 
Bx  finishes  and  t+  1.  It  follows  that  AS(T{)  must  be  in  the  concurrency  list  of  CLQ(t+  1).D 

To  complete  the  proof,  we  must  show  that  the  correspondence  of  concurrency  lists  in  P 
to  concurrency  lists  in  Q  is  1-1.    More  precisely,  if  two  concurrency  lists  CL^(t)  and  CL^t) 

are  equal,  then  the  corresponding  concurrency  lists  CLp(t)  and  CLp(t)  are  also  equal,  so  that 

no  two  states  in  Q  are  merged  unless  there  is  a  corresponding  merger  in  P.   This  result  follows 
from  Claim  2  below. 
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Claim  2.  Let  CLQ(t)  be  the  concurrency  list  associated  with  CLp(t).  If  R  in  CLQ(t)  is 
not  assigned  to  any  task  T  in  CLp(t),  then  R  is  not  assigned  to  any  task  that  is  currently  exe- 
cuting at  time  t. 

At  time  t  =  0,  the  claim  clearly  is  true. 

Assuming  the  hypothesis  is  true  at  time  t,  we  consider  concurrency  lists  at  time  t+  1.  The 
claim  clearly  holds  for  new  concurrency  lists  associated  with  incomplete  shared  variable  sets. 
Suppose  CLp(t)  and  CLQ(t)  are  complete.  By  inductive  hypothesis,  any  task  R  in  CLQ(t)  that 
is  not  assigned  to  a  task  T  in  CLp(t)  is  unassigned  at  time  t. 

Case  1.  The  operation  at  time  t+  1  is  a  synchronous  coordination  between  Tj  and  Tj  in  P. 
In  the  simulation,  there  is  a  synchronous  coordination  between  AS(Tj)  and  AS(Tj).  Applying 
the  update  rule  to  both  executions,  AS(Tj)  is  in  CLQ(t  +  1)  iff  Tj  is  in  CLp(t+  1).  Similarly  for 
AS(Tj)  and  Tj.   Therefore,  the  claim  holds. 

Case  2.  The  operation  at  time  t+  1  is  a  fork  in  which  Tj  creates  tasks  k  in  P,  and  in  Q 
there  is  a  barrier  synchronization  sequence  among  k  +  1  tasks  to  the  Tj  and  the  k  forked  tasks. 
Consider  one  of  the  forked  tasks  Tj.  If  Ts  is  not  in  CLp(t),  then  after  the  barrier  synchroniza- 
tion operations,  AS(Tj)  is  not  in  CLQ(t+  1).  If  Tj  is  in  CLp(t),  then  Tj  is  in  CLp(t  +  1).  There- 
fore, if  CLQ(t+  1)  contains  a  task  R  that  is  assigned  to  a  task  T,  T  is  also  in  CLp(t  +  1). 

Case  3.  The  operation  at  time  t+  1  is  a  join.  If  a  task  Tj  executes  the  join  operation  and 
is  not  in  CLp(t  +  1),  Tj  is  not  a  currently  executing  at  time  t+  1.  If  T(  begins  executing  after 
the  join,  and  is  not  in  CLp(t+  1),  then  for  some  Tj  that  executes  the  join,  Tj  is  not  in  CLp(t), 
and  AS(Tj)  is  not  in  CLQ(t).  After  the  barrier  synchronization  operations  corresponding  to 
the  join,  AS(Tj)  is  not  in  CLQ(t+  1).  Therefore,  AS(Tj)  is  not  in  CLQ(t+  1)  unless  T,  is  in 
CLp(t+l).D 

Now  suppose  that  concurrency  lists  CL^(t)  and  CL^(t)  are  equal,  but  CLp(t)  and  CLp(t) 

are  not.  Then  there  must  be  a  task  R  that  is  in  CL^(t),  and  assigned  to  a  task  in  CLp(t),  (or 

vice  versa),  but  is  not  assigned  to  any  task  currently  executing  at  time  t.    But  this  cannot  hap- 
pen, so  that  the  correspondence  must  be  1-1.  D 
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